Switching-Activity Minimization on Instruction-Level Loop Scheduling for VLIWDSP Applications

نویسندگان

  • Zili Shao
  • Qingfeng Zhuge
  • Meilin Liu
  • Bin Xiao
  • Edwin Hsing-Mean Sha
چکیده

This paper develops an instruction-level loop scheduling technique to reduce both execution time and bus switching activities for applications with loops on VLIW architectures. We propose an algorithm, SAMLS (Switching-Activity Minimization Loop Scheduling), to minimize both schedule length and switching activities for applications with loops. In the algorithm, we obtain the best schedule from the ones that are generated from an initial schedule by repeatedly rescheduling the nodes with schedule length and switching activities minimization based on rotation scheduling and bipartite matching. The experimental results show that our algorithm can greatly reduce both schedule length and bus switching activities compared with the previous work. In order to satisfy ever-growing requirements for high performance DSP (Digital Signal Processing), VLIW (Very Long Instruction Word) architecture is widely adapted in high-end DSP processors. A VLIW processor has multiple functional units (FUs) and can process several instructions simultaneously. While this multiple-FU architecture can be exploited to increase instruction-level parallelism and improve time performance, it causes more power consumption. In embedded systems, high performance DSP needs to be performed not only with high data throughput but also with low power consumption. Therefore, it becomes an important problem to reduce the power consumption of a DSP application with the optimization of time performance on VLIW processors. Since loops are usually the most critical sections and consume a significant amount of power and time in a DSP application, in this paper, we address loop optimization problem and develop an instruction-level loop scheduling technique to minimize both power consumption and execution time of an application on VLIW processors. We focus on reducing the power consumption of applications on VLIW architectures by reducing transition activities on the instruction bus. Due to large capacitance and high transition activities, buses consume a significant fraction of total power dissipation in a processor. For example, buses in DEC Alpha 21064 processor dissipate more than 15% of the total power consumption, and buses in Intel 80386 processor dissipate more than 30% of the total[5]. In this paper, we study bus switchingactivity reduction problem from compiler level by instruction-level scheduling. Using instructionlevel scheduling to reduce bus switching activities can be considered as an extension of the low This work is partially supported by TI University Program, NSF EIA-0103709, Texas ARP 009741-0028-2001, NSF CCR-0309461, USA, and HK POLYU A-PF86 and COMP 4-Z077, HK. power bus coding techniques [14, 15] at compiler level. In a VLIW processor, an instruction word that is fetched onto the instruction bus consists of several instructions. So we can “encode” each long instruction word to reduce bus switching activities by arranging the locations and sequence of instructions of an application. A VLIW processor usually has a big number of instruction bus wires so that it can fetch several instructions simultaneously. Therefore, we can greatly reduce power consumption by reducing switching activities on the instruction bus. In recent years, people have addressed the issue to reduce power consumption by software arrangement at instruction level[17, 7]. Most of work in instruction scheduling for low power focuses on DAG (Directed Acyclic Graph) scheduling. They study the minimization of switching activities considering different problems such as register assignment problem[1], I-cache data bus[19], etc. For VLIW architectures, low-power related instruction scheduling techniques have been proposed in [7, 10]. In most of these work, the scheduling techniques are based on traditional list scheduling in which applications are modeled as Directed Acyclic Graph and only intra-iteration dependencies are considered. In this paper, we show we can significantly improve both the power consumption and time performance for applications with loops on VLIW architectures by carefully exploiting inter-iteration dependencies. Several loop optimization techniques have been proposed to reduce power variations of applications. Yun and Kim [20] propose a power-aware modulo scheduling algorithm to reduce both the step power and peak power for VLIW processors. Yang et al. [18] propose an instruction scheduling compilation technique to minimize power variation in software pipelined loops. A schedule with the minimum power variation may not be the schedule with the minimum total energy consumption nor a schedule with the minimum length. This paper focuses on developing efficient loop scheduling techniques to reduce both schedule length and switching activities so as to reduce the energy consumption of an application. Our work is related to [12, 13, 7]. In [12, 13], Shin et al. propose a post-pass optimal operation rearrangement method for VLIW instruction fetch to reduce bus switching activities by converting the problem to the shortest path problem. In [7], Lee et al. propose an instruction scheduling technique to further reduce bus switching activities on VLIW architectures by horizontally rearranging operations using bipartite matching and vertically rescheduling operations using a heuristic algorithm. In these work, applications are represented as Directed Acyclic Graph (DAG). This paper shows that we can further significantly reduce both bus switching activities and schedule length for applications with loops on VLIW processors. Compared with the technique in[7] that optimizes the DAG part of a loop, our technique shows an average 21.1% reduction in swithing activities and an average 14.4% reduction in schedule length. One of our basic ideas is to exploit inter-iteration dependencies of a loop which is also known as software pipelining[6, 3]. However, the traditional software pipelining such as rotation scheduling[3], etc., is performance-oriented and does not consider switching activities reduction. Therefore, we propose a loop scheduling approach that optimizes both the schedule length and bus switching activities based on rotation scheduling. In our previous work [11], we prove that the loop scheduling problem with minimum latency and minimum switching activities is NP-complete with or without resource constraints, and propose a heuristic algorithm to reduce switching activities and schedule length based on a greedy strategy in which each node in a rotation node set is considered separately and re-scheduled to the best location one by one. The algorithm based on this strategy may not give good results for some applications. Thus, this paper proposes a better algorithm, SAMLS (Switching-Activity Minimization Loop Scheduling), based on rotation scheduling and bipartite matching. In SAMLS, we use a strategy in which all nodes in a rotation node set are considered together and re-scheduled base on a best matching obtained by constructing a weighted bipartite matching between nodes and empty locations. And we select the best schedule among the ones that are generated from a given initial schedule by repeatedly rescheduling the nodes with schedule length and switching activities minimization. SAMLS can be applied to various VLIW architectures. We experiment with our algorithm on a set of benchmarks. The experiments are performed on a VLIW architecture similar to that in [7] using the real TI C6000 instructions. The experimental results show that our algorithm can greatly reduce both bus switching activities and schedule length compared with the previous work. The remainder of this paper is organized as follows. In Section 2, we give the basic definitions and concepts used in the rest of the paper. The SAMLS algorithm is presented in Section 3. Experimental results and concluding remarks are provided in Section 4 and Section 5, respectively. In this section, we introduce basic models and concepts that will be used in the later sections. We first introduce the target VLIW architecture and cost model. Then we explain how to use cyclic DFG to model loops. Next we introduce the static schedule and define the switching activities of a schedule. Finally, we introduce the basic concepts of rotation scheduling. (Mul/Div) FU 1 32 bits int dotp( short a [ ], short b [ ] ) { int sum, i; int sum1 = 0 ; int sum2 = 0 ; for( i = 0; i < 100/2; i+2 ) { sum1 += a[i] * b[i]; sum2 += a[i+1] * b[i+1]; } return sum1 + sum2; } _dotp .cproc a, b

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Algorithms and analysis of scheduling for low-power high-performance DSP on VLIW processors

Switching activity and schedule length are the two most important factors that influence the energy consumption of an application executed on a VLIW (very long instruction word) processor. Considering these two factors together, we propose an instruction-level energy-minimisation scheduling technique to reduce the energy consumption of applications on VLIW processors. We first formally prove th...

متن کامل

Robust H2 switching gain-scheduled controller design for switched uncertain LPV systems

In this article, a new approach is proposed to design robust switching gain-scheduled dynamic output feedback control for switched uncertain continuous-time linear parameter varying (LPV) systems. The proposed robust switching gain-scheduled controllers are robustly designed so that the stability and H2-gain performance of the switched closed-loop uncertain LPV system can be guaranteed even und...

متن کامل

Code Size Aware Compilation for Real-Time Applications

Statically constructed plan of execution (POE) and aggressive instruction level parallelism (ILP) exploitation make EPIC/VLIW processors appropriate for high performance real-time systems. On the one hand, the compiler controlled POE makes the worst-case execution-time (WCET) analysis more accurate as run-time variations are minimized. On the other hand, the compiler can leverage ILP optimizati...

متن کامل

Instruction Level Parallelism Loop Unrolling

K – Survey of Instruction Set Architectures related to instruction-, data-, thread-, and requestlevel parallelism necessary for understanding Loop unrolling. ILP, Compiler techniques to increase ILP. Register Renaming, Pipeline Scheduling, Loop Unrolling. Conclusion. CPE 731, ILP. 3. Instruction Level Parallelism. 5 Optimizing Program Performance(Loop Unrolling and Enhancing Parallelism ) Michael.

متن کامل

A Constraint Logic Programming Based Approach to High-Level Synthesis for Low Power

This paper presents high-level synthesis problems and solutions specific for low power synthesis. It includes ideas of how to minimize the power consumption by switched capacitance reduction during operation scheduling and resource binding. This process uses switching activity data obtained from simulation of the design at the register transfer level. The novelty of this approach is the use of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004